The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
# Installing the libraries with the specified version.
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==2.0.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
#Libraries for reading and manipulating data
import pandas as pd
import numpy as np
#Libaries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
#For splitting data into training and testing sets
from sklearn.model_selection import train_test_split
#For imputing missing values
from sklearn.impute import SimpleImputer
#For generating different metric scores for the models
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
#For oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
#For Randomized Search to do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
#For building models using decision tree and boosting techniques
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, BaggingClassifier
from xgboost import XGBClassifier
#Define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
#Prevent scientific notation to be displayed in numerical columns
pd.set_option("display.float_format", lambda x: "%.3f" % x)
#Do not show warnings
import warnings
warnings.filterwarnings("ignore")
#Access Google Drive
from google.colab import drive
drive.mount('/content/drive')
#Read BankChurners.csv data
churn = pd.read_csv('/content/drive/MyDrive/Data/BankChurners.csv')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
#Create a copy of the churn dataset
data = churn.copy()
#View first 5 rows of data
data.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
#View last 5 rows of data
data.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.000 | 1851 | 2152.000 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.000 | 0 | 5409.000 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.000 | 0 | 5281.000 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.000 | 1961 | 8427.000 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
#Check shape of dataset
data.shape
(10127, 21)
#Get info of the dataset - view column names, data types, and see if there are any missing values
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
#Find total number of missing values in the data
data.isnull().sum()
CLIENTNUM 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
#Check for duplicate values
data.duplicated().sum()
0
#View summary statistics for the numerical columns
data.describe(include=['int64', 'float64']).T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
#View summary statistics for the object columns
data.describe(include=['object']).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
#Get value counts for unique values in each categorical column (object type)
for column in data.columns:
if data[column].dtype == 'object':
print(data[column].value_counts(normalize=True))
print('-' * 50)
Attrition_Flag Existing Customer 0.839 Attrited Customer 0.161 Name: proportion, dtype: float64 -------------------------------------------------- Gender F 0.529 M 0.471 Name: proportion, dtype: float64 -------------------------------------------------- Education_Level Graduate 0.363 High School 0.234 Uneducated 0.173 College 0.118 Post-Graduate 0.060 Doctorate 0.052 Name: proportion, dtype: float64 -------------------------------------------------- Marital_Status Married 0.500 Single 0.420 Divorced 0.080 Name: proportion, dtype: float64 -------------------------------------------------- Income_Category Less than $40K 0.352 $40K - $60K 0.177 $80K - $120K 0.152 $60K - $80K 0.138 abc 0.110 $120K + 0.072 Name: proportion, dtype: float64 -------------------------------------------------- Card_Category Blue 0.932 Silver 0.055 Gold 0.011 Platinum 0.002 Name: proportion, dtype: float64 --------------------------------------------------
Questions:
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?#Function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None, title_name=''):
#Set up boxplot and histogram subplots figure
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
#Plot boxplot - show mean with triangle
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
)
#Plot histogram
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
)
#Indicate mean with green dotted line
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
)
#Indicate median with black dotted line
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
)
plt.show()
#Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
#Set figure size
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
#Get unique target values
target_uniq = data[target].unique()
#Plot histogram of target - Existing Customers
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
#Plot histogram of target - Attrited Customers
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
#Plot boxplot (with outliers) of predictor vs target
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
#Plot boxplot (without outliers) of predictor vs target - w.r.t Attrition Flag
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
#Plot the histogram and boxplot of Total Transaction Amount column
histogram_boxplot(data, 'Total_Trans_Amt', figsize=(12, 7), kde=True)
#Get the mode, mean, and median values
data['Total_Trans_Amt'].mode(), data['Total_Trans_Amt'].mean(), data['Total_Trans_Amt'].median()
(0 4253 1 4509 Name: Total_Trans_Amt, dtype: int64, 4404.086303939963, 3899.0)
#Get value counts for Total Transaction Amount
data['Total_Trans_Amt'].value_counts()
Total_Trans_Amt
4253 11
4509 11
4518 10
2229 10
4220 9
..
1274 1
4521 1
3231 1
4394 1
10294 1
Name: count, Length: 5033, dtype: int64
#Get value counts for Level of Education
data['Education_Level'].value_counts(normalize=True)
Education_Level Graduate 0.363 High School 0.234 Uneducated 0.173 College 0.118 Post-Graduate 0.060 Doctorate 0.052 Name: proportion, dtype: float64
#Count Plot for Level of Education Distribution
sns.countplot(data=data, x='Education_Level', hue='Education_Level', palette='viridis')
#Include title
plt.title('Education Level Distribution')
#Label x and y axes
plt.xlabel('Education Level')
plt.ylabel('Customer Count')
plt.show()
#Count Plot for Level of Income Distribution
#Set figure size
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='Income_Category', hue='Income_Category', palette='viridis')
#Include title
plt.title('Income Level Distribution')
#Label x and y axes
plt.xlabel('Income Level')
plt.ylabel('Customer Count')
plt.show()
#Get value counts for Income Category
data['Income_Category'].value_counts(normalize=True)
Income_Category Less than $40K 0.352 $40K - $60K 0.177 $80K - $120K 0.152 $60K - $80K 0.138 abc 0.110 $120K + 0.072 Name: proportion, dtype: float64
#Plot Total Amount Change (Ratio of Q4 / Q1) by Attrition Flag
distribution_plot_wrt_target(data, 'Total_Amt_Chng_Q4_Q1', 'Attrition_Flag')
#Plot distribution of Months Inactive in past 12 months by Attrition Flag
distribution_plot_wrt_target(data, 'Months_Inactive_12_mon', 'Attrition_Flag')
#Create a copy of churn dataframe
data_corr = churn.copy()
#Convert columns with object data type to string
for column in data_corr.columns:
if data_corr[column].dtype == 'object':
data_corr[column] = data_corr[column].astype(str)
#Take the Education Level, Marital Status, and Income Category columns and fill missing values with their respective modes
data_corr['Education_Level'] = data_corr['Education_Level'].fillna(data_corr['Education_Level'].mode()[0])
data_corr['Marital_Status'] = data_corr['Marital_Status'].fillna(data_corr['Marital_Status'].mode()[0])
data_corr['Income_Category'] = data_corr['Income_Category'].fillna(data_corr['Income_Category'].mode()[0])
#Drop the Client Number column since it does not contribute to the analysis
data_corr = data_corr.drop('CLIENTNUM', axis=1)
#Convert Attrition Flag to string
data_corr['Attrition_Flag'] = data_corr['Attrition_Flag'].astype('string')
#Replace Existing Customer with 0 and Attrited Customer with 1 in the Attrition Flag column
data_corr['Attrition_Flag'] = data_corr['Attrition_Flag'].replace({'Existing Customer': '0', 'Attrited Customer': '1'})
#Convert Attrition Flag to integer
data_corr['Attrition_Flag'] = data_corr['Attrition_Flag'].astype('int64')
#Convert values of categorical columns to zeros and ones
features = pd.get_dummies(data_corr, columns=['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category'], drop_first=True)
#Create correlation matrix with numerical columns
numerical_corr_matrix = data_corr.drop(['Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category'], axis=1).corr()
#Set figure size
plt.figure(figsize=(15, 8))
#Plot heatmap
sns.heatmap(numerical_corr_matrix, annot=True, cmap='coolwarm')
#Include title
plt.title('Correlation Matrix: Numerical Columns')
plt.show()
#Plot histogram and box plot for Customer Age
histogram_boxplot(data, 'Customer_Age', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Customer_Age'].mode(), data['Customer_Age'].mean(), data['Customer_Age'].median()
(0 44 Name: Customer_Age, dtype: int64, 46.32596030413745, 46.0)
#Plot histogram and box plot of Months on Book
histogram_boxplot(data, 'Months_on_book', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Months_on_book'].mode(), data['Months_on_book'].mean(), data['Months_on_book'].median()
(0 36 Name: Months_on_book, dtype: int64, 35.928409203120374, 36.0)
#Plot histogram and box plot of Total Relationship Count
histogram_boxplot(data, 'Total_Relationship_Count', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Total_Relationship_Count'].mode(), data['Total_Relationship_Count'].mean(), data['Total_Relationship_Count'].median()
(0 3 Name: Total_Relationship_Count, dtype: int64, 3.8125802310654686, 4.0)
#Plot histogram and box plot of Months Inactive - past 12 months
histogram_boxplot(data, 'Months_Inactive_12_mon', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Months_Inactive_12_mon'].mode(), data['Months_Inactive_12_mon'].mean(), data['Months_Inactive_12_mon'].median()
(0 3 Name: Months_Inactive_12_mon, dtype: int64, 2.3411671768539546, 2.0)
#Plot histogram and box plot of Contacts Count - past 12 months
histogram_boxplot(data, 'Contacts_Count_12_mon', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Contacts_Count_12_mon'].mode(), data['Contacts_Count_12_mon'].mean(), data['Contacts_Count_12_mon'].median()
(0 3 Name: Contacts_Count_12_mon, dtype: int64, 2.4553174681544387, 2.0)
#Plot histogram and box plot of Credit Limit
histogram_boxplot(data, 'Credit_Limit', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Credit_Limit'].mode(), data['Credit_Limit'].mean(), data['Credit_Limit'].median()
(0 34516.000 Name: Credit_Limit, dtype: float64, 8631.953698034955, 4549.0)
#Plot histogram and box plot of Total Revolving Balance
histogram_boxplot(data, 'Total_Revolving_Bal', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Total_Revolving_Bal'].mode(), data['Total_Revolving_Bal'].mean(), data['Total_Revolving_Bal'].median()
(0 0 Name: Total_Revolving_Bal, dtype: int64, 1162.8140614199665, 1276.0)
#Plot Average Open To Buy
histogram_boxplot(data, 'Avg_Open_To_Buy', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Avg_Open_To_Buy'].mode(), data['Avg_Open_To_Buy'].mean(), data['Avg_Open_To_Buy'].median()
(0 1438.300 Name: Avg_Open_To_Buy, dtype: float64, 7469.139636614989, 3474.0)
#Plot histogram and box plot of Average Utilization Ratio
histogram_boxplot(data, 'Avg_Utilization_Ratio', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Avg_Utilization_Ratio'].mode(), data['Avg_Utilization_Ratio'].mean(), data['Avg_Utilization_Ratio'].median()
(0 0.000 Name: Avg_Utilization_Ratio, dtype: float64, 0.2748935518909845, 0.176)
#Plot histogram and box plot of Total Transaction Count
histogram_boxplot(data, 'Total_Trans_Ct', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Total_Trans_Ct'].mode(), data['Total_Trans_Ct'].mean(), data['Total_Trans_Ct'].median()
(0 81 Name: Total_Trans_Ct, dtype: int64, 64.85869457884863, 67.0)
#Plot histogram and box plot of Total Transaction Amount
histogram_boxplot(data, 'Total_Trans_Amt', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Total_Trans_Amt'].mode(), data['Total_Trans_Amt'].mean(), data['Total_Trans_Amt'].median()
(0 4253 1 4509 Name: Total_Trans_Amt, dtype: int64, 4404.086303939963, 3899.0)
#Plot histogram and box plot of Total Count Change - Q4 to Q1 Ratio
histogram_boxplot(data, 'Total_Ct_Chng_Q4_Q1', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Total_Ct_Chng_Q4_Q1'].mode(), data['Total_Ct_Chng_Q4_Q1'].mean(), data['Total_Ct_Chng_Q4_Q1'].median()
(0 0.667 Name: Total_Ct_Chng_Q4_Q1, dtype: float64, 0.7122223758269972, 0.702)
#Plot histogram and box plot of Total Amount Change - Q4 to Q1 Ratio
histogram_boxplot(data, 'Total_Amt_Chng_Q4_Q1', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Total_Amt_Chng_Q4_Q1'].mode(), data['Total_Amt_Chng_Q4_Q1'].mean(), data['Total_Amt_Chng_Q4_Q1'].median()
(0 0.791 Name: Total_Amt_Chng_Q4_Q1, dtype: float64, 0.7599406536980349, 0.736)
#Plot histogram and box plot of Dependent Count
histogram_boxplot(data, 'Dependent_count', figsize=(12, 7), kde=True)
#Get mode, mean, and median
data['Dependent_count'].mode(), data['Dependent_count'].mean(), data['Dependent_count'].median()
(0 3 Name: Dependent_count, dtype: int64, 2.3462032191172115, 2.0)
#Set figure size
plt.figure(figsize=(10, 6))
#Count Plot for Card Category
sns.countplot(data=data, x='Card_Category', hue='Card_Category')
#Include title
plt.title('Card Category Distribution')
#Label x and y axes
plt.xlabel('Card Category')
plt.ylabel('Customer Count')
plt.show()
#Get value counts for each card type
data['Card_Category'].value_counts(normalize=True)
Card_Category Blue 0.932 Silver 0.055 Gold 0.011 Platinum 0.002 Name: proportion, dtype: float64
#Set figure size
plt.figure(figsize=(10, 6))
#Count Plot for Card Category
sns.countplot(data=data, x='Gender', hue='Gender')
#Include title
plt.title('Gender Distribution')
#Label x and y axes
plt.xlabel('Gender')
plt.ylabel('Customer Count')
Text(0, 0.5, 'Customer Count')
#Get value counts for Gender
data['Gender'].value_counts(normalize=True)
Gender F 0.529 M 0.471 Name: proportion, dtype: float64
#Make a copy of original dataframe
status = churn.copy()
#Get the mode for marital status to impute missing values
marital_status_mode = status['Marital_Status'].mode()[0]
#Impute unknown values with mode martial status
status['Marital_Status'] = status['Marital_Status'].fillna(marital_status_mode)
#Set figure size
plt.figure(figsize=(10, 6))
#Count Plot for Marital Status
sns.countplot(data=status, x='Marital_Status', hue='Marital_Status')
#Include title
plt.title('Marital Status Distribution')
#Label x and y axes
plt.xlabel('Marital Status')
plt.ylabel('Customer Count')
plt.show()
#Get value counts excluding missing values
data[~data['Marital_Status'].isnull()]['Marital_Status'].value_counts(normalize=True)
Marital_Status Married 0.500 Single 0.420 Divorced 0.080 Name: proportion, dtype: float64
#Get value counts after imputation
status['Marital_Status'].value_counts(normalize=True)
Marital_Status Married 0.537 Single 0.389 Divorced 0.074 Name: proportion, dtype: float64
#Set figure size
plt.figure(figsize=(10, 6))
#Count Plot for Attrition Flag
sns.countplot(data=data, x='Attrition_Flag', hue='Attrition_Flag')
#Include title
plt.title('Attrition Flag Distribution')
#Label x and y axes
plt.xlabel('Attrition Flag')
plt.ylabel('Customer Count')
plt.show()
#Get value counts for Attrition Flag
data['Attrition_Flag'].value_counts(normalize=True)
Attrition_Flag Existing Customer 0.839 Attrited Customer 0.161 Name: proportion, dtype: float64
#Histogram and box plots of Customer Age by Attrition Flag
distribution_plot_wrt_target(data, 'Customer_Age', 'Attrition_Flag')
#Set figure size
plt.figure(figsize=(10, 6))
#Count Plot for Gender by Attrition Flag
sns.countplot(data=data, x='Gender', hue='Attrition_Flag')
#Include title
plt.title('Gender by Attrition Flag')
#Label x and y axes
plt.xlabel('Gender')
plt.ylabel('Customer Count')
plt.show()
#Gender count by Attrition Flag
data.groupby('Gender')['Attrition_Flag'].value_counts(normalize=True)
Gender Attrition_Flag
F Existing Customer 0.826
Attrited Customer 0.174
M Existing Customer 0.854
Attrited Customer 0.146
Name: proportion, dtype: float64
#Histogram and box plots of Credit Limit by Attrition Flag
distribution_plot_wrt_target(data, 'Credit_Limit', 'Attrition_Flag')
#Set figure size
plt.figure(figsize=(10, 6))
#Count Plot for Card Category by Attrition Flag
sns.countplot(data=data, x='Education_Level', hue='Attrition_Flag')
#Include title
plt.title('Education Level by Attrition Flag')
#Label x and y axes
plt.xlabel('Education Level')
plt.ylabel('Customer Count')
plt.show()
#Get value counts for Education Level by Attrition Flag
data.groupby('Education_Level')['Attrition_Flag'].value_counts(normalize=True)
Education_Level Attrition_Flag
College Existing Customer 0.848
Attrited Customer 0.152
Doctorate Existing Customer 0.789
Attrited Customer 0.211
Graduate Existing Customer 0.844
Attrited Customer 0.156
High School Existing Customer 0.848
Attrited Customer 0.152
Post-Graduate Existing Customer 0.822
Attrited Customer 0.178
Uneducated Existing Customer 0.841
Attrited Customer 0.159
Name: proportion, dtype: float64
#Filter out customers who are missing a marital status
status = data[~data['Marital_Status'].isnull()]
#Set figure size
plt.figure(figsize=(10, 6))
#Count Plot for Marital Status by Attrition Flag
sns.countplot(data=status, x='Marital_Status', hue='Attrition_Flag')
#Include title
plt.title('Marital Status by Attrition Flag')
#Label x and y axes
plt.xlabel('Marital Status')
plt.ylabel('Customer Count')
plt.show()
#Get value counts for each marital status by attrition flag
data.groupby('Marital_Status')['Attrition_Flag'].value_counts(normalize=True)
Marital_Status Attrition_Flag
Divorced Existing Customer 0.838
Attrited Customer 0.162
Married Existing Customer 0.849
Attrited Customer 0.151
Single Existing Customer 0.831
Attrited Customer 0.169
Name: proportion, dtype: float64
#Histogram and box plots of Contacts Count by Attrition Flag
distribution_plot_wrt_target(data, 'Contacts_Count_12_mon', 'Attrition_Flag')
#Histogram and box plots of Average Utilization Ratio by Attrition Flag
distribution_plot_wrt_target(data, 'Avg_Utilization_Ratio', 'Attrition_Flag')
#Histogram and box plots of Total Revolving Balance by Attrition Flag
distribution_plot_wrt_target(data, 'Total_Revolving_Bal', 'Attrition_Flag')
#Histogram and box plots of Dependent Count by Attrition Flag
distribution_plot_wrt_target(data, 'Dependent_count', 'Attrition_Flag')
#Set figure size
plt.figure(figsize=(10, 6))
#Count Plot for Income Category by Attrition Flag
sns.countplot(data=data, x='Income_Category', hue='Attrition_Flag')
#Include title
plt.title('Income Category by Attrition Flag')
#Label x and y axes
plt.xlabel('Income Category')
plt.ylabel('Customer Count')
plt.show()
#Histogram and box plots of Months on Book by Attrition Flag
distribution_plot_wrt_target(data, 'Months_on_book', 'Attrition_Flag')
#Histogram and box plots of Total Relationship Count by Attrition Flag
distribution_plot_wrt_target(data, 'Total_Relationship_Count', 'Attrition_Flag')
#Histogram and box plots of Months Inactive by Attrition Flag
distribution_plot_wrt_target(data, 'Months_Inactive_12_mon', 'Attrition_Flag')
#Histogram and box plots of Average Open to Buy by Attrition Flag
distribution_plot_wrt_target(data, 'Avg_Open_To_Buy', 'Attrition_Flag')
#Histogram and box plots of Total Transaction Amount by Attrition Flag
distribution_plot_wrt_target(data, 'Total_Trans_Amt', 'Attrition_Flag')
#Histogram and box plots of Total Transaction Count by Attrition Flag
distribution_plot_wrt_target(data, 'Total_Trans_Ct', 'Attrition_Flag')
#Histogram and box plots of Total Transaction Count Change (Q4 / Q1) by Attrition Flag
distribution_plot_wrt_target(data, 'Total_Ct_Chng_Q4_Q1', 'Attrition_Flag')
#Create empty list to insert numerical features
numerical_features = ['Customer_Age', 'Dependent_count', 'Months_on_book',
'Total_Relationship_Count', 'Months_Inactive_12_mon','Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
#Acquire total number of outliers (if any) for each feature
for feature in numerical_features:
#Get the 1st and 3rd quartiles for each feature
Q1 = data[feature].quantile(0.25)
Q3 = data[feature].quantile(0.75)
#Find IQR
IQR = Q3 - Q1
#Get lower and upper fences
lower_fence = Q1 - 1.5*IQR
upper_fence = Q3 + 1.5*IQR
#Filter dataframe for records that are less than the lower fence or greater than the upper fence
#Get total number of outliers
outliers = data[(data[feature] < lower_fence) | (data[feature] > upper_fence)].shape[0]
#Print percentage
print(f'{feature} has {round((outliers/data.shape[0])*100, 2)}% outliers')
Customer_Age has 0.02% outliers Dependent_count has 0.0% outliers Months_on_book has 3.81% outliers Total_Relationship_Count has 0.0% outliers Months_Inactive_12_mon has 3.27% outliers Contacts_Count_12_mon has 6.21% outliers Credit_Limit has 9.72% outliers Total_Revolving_Bal has 0.0% outliers Avg_Open_To_Buy has 9.51% outliers Total_Amt_Chng_Q4_Q1 has 3.91% outliers Total_Trans_Amt has 8.85% outliers Total_Trans_Ct has 0.02% outliers Total_Ct_Chng_Q4_Q1 has 3.89% outliers Avg_Utilization_Ratio has 0.0% outliers
#Convert Gender to Female = 0, Male = 1
data['Gender'] = data['Gender'].map({'F': 0, 'M': 1})
#Convert Attrition Flag to numerical data type - Existing Customer = 0, Attrited Customer = 1
data['Attrition_Flag'] = data['Attrition_Flag'].map({'Existing Customer': 0, 'Attrited Customer': 1})
#Replace "abc" with NAs
data['Income_Category'] = data['Income_Category'].replace('abc', np.nan)
#Set features dataframe
X = data.drop(['CLIENTNUM', 'Attrition_Flag'], axis=1)
#Set target dataframe
y = data['Attrition_Flag']
#Replace True with 1 and False with 0
X = X.replace({True: 1, False: 0})
#Split features and target into temporary and testing sets - temporary 80% / testing 20%
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
#Check shape of temporary and testing sets
print(f'Temporary set shape: {X_temp.shape}')
print(f'Testing set shape: {X_test.shape}')
Temporary set shape: (8101, 19) Testing set shape: (2026, 19)
#Split temporary into training and validation sets - 75% for training and 25% for validation
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp)
#Check shape of training and validation sets
print(f'Training set shape: {X_train.shape}')
print(f'Validation set shape: {X_val.shape}')
Training set shape: (6075, 19) Validation set shape: (2026, 19)
#Columns to impute
cols_to_impute = ['Education_Level', 'Marital_Status', 'Income_Category']
#Impute the missing values in Education Level, Marital Status, and Income Category
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#Fit and transform the imputer on training set
X_train[cols_to_impute] = imp_mode.fit_transform(X_train[cols_to_impute])
#Transform the imputer on validation set
X_val[cols_to_impute] = imp_mode.transform(X_val[cols_to_impute])
#Transform the imputer on testing set
X_test[cols_to_impute] = imp_mode.transform(X_test[cols_to_impute])
#Encode categorical features with get_dummies
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)
#Check shape of training, validation, and testing sets
print(f'Training set shape: {X_train.shape}')
print(f'Validation set shape: {X_val.shape}')
print(f'Testing set shape: {X_test.shape}')
Training set shape: (6075, 29) Validation set shape: (2026, 29) Testing set shape: (2026, 29)
# Function to calculate different metrics and check performance of classification model
def model_performance_classification_sklearn(model, predictors, target):
#Predict using the independent variables
pred = model.predict(predictors)
#Accuracy score
acc = accuracy_score(target, pred)
#Recall score
recall = recall_score(target, pred)
#Precision score
precision = precision_score(target, pred)
#F1 score
f1 = f1_score(target, pred)
#Create metrics dataframe
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
#Empty list to store all the models
models = []
#Appending models into the list
models.append(("Decision tree", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))
#Get recall scores for each model on training set
print('Training Performance:', '\n')
for name, model in models:
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
score = recall_score(y_train, y_pred)
print(f'{name}: {score}')
print('-'*50)
#Get recall scores for each model on validation set
print('Validation Performance:', '\n')
for name, model in models:
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
score_val = recall_score(y_val, y_pred)
print(f'{name}: {score_val}')
Training Performance: Decision tree: 1.0 Bagging: 0.9846311475409836 Random forest: 1.0 AdaBoost: 0.826844262295082 XGBoost: 1.0 -------------------------------------------------- Validation Performance: Decision tree: 0.8159509202453987 Bagging: 0.8006134969325154 Random forest: 0.803680981595092 AdaBoost: 0.852760736196319 XGBoost: 0.901840490797546
#Empty list to store all the oversampled models
models_over = []
#Appending models into the list
models_over.append(("Decision tree", DecisionTreeClassifier(random_state=1)))
models_over.append(("Bagging", BaggingClassifier(random_state=1)))
models_over.append(("Random forest", RandomForestClassifier(random_state=1)))
models_over.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models_over.append(("XGBoost", XGBClassifier(random_state=1)))
#Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
#Training count after oversampling
print(f'Oversampling - X_Train Count: {X_train_over.shape[0]}')
print(f'Oversampling - y_Train Count: {y_train_over.shape[0]}')
print('-'*35)
#Label count after oversampling
print(f'Oversampling - Attrited Label Count: {sum(y_train_over == 1)}')
print(f'Oversampling - Existing Label Count: {sum(y_train_over == 0)}')
Oversampling - X_Train Count: 10198 Oversampling - y_Train Count: 10198 ----------------------------------- Oversampling - Attrited Label Count: 5099 Oversampling - Existing Label Count: 5099
#Get recall scores for each model on training set
print('Training Performance:', '\n')
for name, model in models:
model.fit(X_train_over, y_train_over)
y_pred = model.predict(X_train_over)
score = recall_score(y_train_over, y_pred)
print(f'{name}: {score}')
print('-'*50)
#Get recall scores for each model on validation set
print('Validation Performance:', '\n')
for name, model in models:
model.fit(X_train_over, y_train_over)
y_pred = model.predict(X_val)
score_val = recall_score(y_val, y_pred)
print(f'{name}: {score_val}')
Training Performance: Decision tree: 1.0 Bagging: 0.9978427142576975 Random forest: 1.0 AdaBoost: 0.9635222592665228 XGBoost: 1.0 -------------------------------------------------- Validation Performance: Decision tree: 0.8404907975460123 Bagging: 0.8926380368098159 Random forest: 0.8742331288343558 AdaBoost: 0.8895705521472392 XGBoost: 0.911042944785276
#Empty list to store all the models
models_under = []
#Appending models into the list
models_under.append(("Decision tree", DecisionTreeClassifier(random_state=1)))
models_under.append(("Bagging", BaggingClassifier(random_state=1)))
models_under.append(("Random forest", RandomForestClassifier(random_state=1)))
models_under.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models_under.append(("XGBoost", XGBClassifier(random_state=1)))
#Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)
#Training count after undersampling
print(f'Oversampling - X_Train Count: {X_train_under.shape[0]}')
print(f'Oversampling - y_Train Count: {y_train_under.shape[0]}')
print('-'*35)
#Label count after undersampling
print(f'Oversampling - Attrited Label Count: {sum(y_train_under == 1)}')
print(f'Oversampling - Existing Label Count: {sum(y_train_under == 0)}')
Oversampling - X_Train Count: 1952 Oversampling - y_Train Count: 1952 ----------------------------------- Oversampling - Attrited Label Count: 976 Oversampling - Existing Label Count: 976
#Get recall scores for each model on training set
print('Training Performance:', '\n')
for name, model in models:
model.fit(X_train_under, y_train_under)
y_pred = model.predict(X_train_under)
score = recall_score(y_train_under, y_pred)
print(f'{name}: {score}')
print('-'*50)
#Get recall scores for each model on validation set
print('Validation Performance:', '\n')
for name, model in models:
model.fit(X_train_under, y_train_under)
y_pred = model.predict(X_val)
score_val = recall_score(y_val, y_pred)
print(f'{name}: {score_val}')
Training Performance: Decision tree: 1.0 Bagging: 0.9918032786885246 Random forest: 1.0 AdaBoost: 0.9528688524590164 XGBoost: 1.0 -------------------------------------------------- Validation Performance: Decision tree: 0.8865030674846626 Bagging: 0.9294478527607362 Random forest: 0.9355828220858896 AdaBoost: 0.9601226993865031 XGBoost: 0.9693251533742331
#Gather cross-validation results data for dataframe
results = {'Model': ['Decision Tree', 'Bagging', 'Random Forest', 'AdaBoost', 'XGBoost'],
'Original': [81.6, 80.1, 80.4, 85.3, 90.2],
'Oversampling': [84.0, 89.3, 87.4, 88.9, 99.1],
'Undersampling': [88.7, 92.9, 93.6, 96.0, 96.9]}
#Create dataframe to show cross-validation recall scores
val_df = pd.DataFrame(results, columns=['Model', 'Original', 'Oversampling', 'Undersampling'])
#Print dataframe
print('Validation: Recall Scores')
val_df
Validation: Recall Scores
| Model | Original | Oversampling | Undersampling | |
|---|---|---|---|---|
| 0 | Decision Tree | 81.600 | 84.000 | 88.700 |
| 1 | Bagging | 80.100 | 89.300 | 92.900 |
| 2 | Random Forest | 80.400 | 87.400 | 93.600 |
| 3 | AdaBoost | 85.300 | 88.900 | 96.000 |
| 4 | XGBoost | 90.200 | 99.100 | 96.900 |
#Define the model
model = XGBClassifier(random_state=1)
#Define the parameter grid to pass into Randomized Search
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.05,0.1],
'gamma':[1,3],
'subsample':[0.6,0.8]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring='recall', cv=5, random_state=1)
#Fitting parameters on original training set in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
#Print results
print(f'Best parameters are {randomized_cv.best_params_} with CV score = {randomized_cv.best_score_}:')
Best parameters are {'subsample': 0.6, 'scale_pos_weight': 5, 'n_estimators': 100, 'learning_rate': 0.05, 'gamma': 3} with CV score = 0.9190423861852434:
#Define tuned xgboost model
xgboost_tuned_original = XGBClassifier(n_estimators=100, learning_rate=0.05, gamma=3, subsample=0.6, scale_pos_weight=5)
#Fit tuned model on training set
xgboost_tuned_original.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)#Check model performance on training set
model_performance_classification_sklearn(xgboost_tuned_original, X_train, y_train)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.979 | 0.996 | 0.886 | 0.938 |
#Check model performance on validation set
model_performance_classification_sklearn(xgboost_tuned_original, X_val, y_val)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.960 | 0.942 | 0.832 | 0.883 |
#Define the model
model = XGBClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.05,0.1],
'gamma':[1,3],
'subsample':[0.6,0.8]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring='recall', cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
#Print results
print(f'Best parameters are {randomized_cv.best_params_} with CV score = {randomized_cv.best_score_}:')
Best parameters are {'subsample': 0.6, 'scale_pos_weight': 5, 'n_estimators': 100, 'learning_rate': 0.05, 'gamma': 3} with CV score = 0.9858823529411765:
#Define tuned xgboost oversampled model
xgboost_tuned_oversampled = XGBClassifier(n_estimators=100, learning_rate=0.05, gamma=3, subsample=0.6, scale_pos_weight=5)
#Fit tuned model on training oversampled set
xgboost_tuned_oversampled.fit(X_train_over, y_train_over)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)#Check performance on oversampled training set
model_performance_classification_sklearn(xgboost_tuned_oversampled, X_train_over, y_train_over)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.974 | 0.999 | 0.952 | 0.975 |
#Check performance on validation set
model_performance_classification_sklearn(xgboost_tuned_oversampled, X_val, y_val)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.936 | 0.957 | 0.731 | 0.829 |
#Define model
model = BaggingClassifier(random_state=1)
#Define parameter grid for Randomized Search
param_grid = {
'n_estimators': [25,50,75],
'max_features': [0.2,0.4,0.7]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring='recall', cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
#Print results
print(f'Best parameters are {randomized_cv.best_params_} with CV score = {randomized_cv.best_score_}:')
Best parameters are {'n_estimators': 75, 'max_features': 0.7} with CV score = 0.9790188381535145:
#Define tuned xgboost oversampled model
bagging_tuned_oversampled = BaggingClassifier(n_estimators=75, max_features=0.7)
#Fit tuned model on training oversampled set
bagging_tuned_oversampled.fit(X_train_over, y_train_over)
BaggingClassifier(max_features=0.7, n_estimators=75)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier(max_features=0.7, n_estimators=75)
#Check performance on oversampled training set
model_performance_classification_sklearn(bagging_tuned_oversampled, X_train_over, y_train_over)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
#Check performance on validation set
model_performance_classification_sklearn(bagging_tuned_oversampled, X_val, y_val)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.961 | 0.896 | 0.866 | 0.881 |
#Define the model
model = XGBClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.05,0.1],
'gamma':[1,3],
'subsample':[0.6,0.8]
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring='recall', cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_under,y_train_under)
#Print results
print(f'Best parameters are {randomized_cv.best_params_} with CV score = {randomized_cv.best_score_}:')
Best parameters are {'subsample': 0.6, 'scale_pos_weight': 5, 'n_estimators': 100, 'learning_rate': 0.05, 'gamma': 3} with CV score = 0.9754264782836213:
#Define tuned xgboost oversampled model
xgboost_tuned_undersampled = XGBClassifier(n_estimators=100, learning_rate=0.05, gamma=3, subsample=0.6, scale_pos_weight=5)
#Fit tuned model on training oversampled set
xgboost_tuned_undersampled.fit(X_train_under, y_train_under)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)#Check performance on undersampled training set
model_performance_classification_sklearn(xgboost_tuned_undersampled, X_train_under, y_train_under)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.965 | 1.000 | 0.935 | 0.966 |
#Check performance on validation set
model_performance_classification_sklearn(xgboost_tuned_undersampled, X_val, y_val)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.899 | 0.988 | 0.616 | 0.759 |
#Define the model
model = AdaBoostClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
'n_estimators': np.arange(50,110,25),
'learning_rate': [0.01,0.05,0.1],
'base_estimator': [DecisionTreeClassifier(max_depth=3, random_state=1), DecisionTreeClassifier(max_depth=4, random_state=1)],
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring='recall', cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_under,y_train_under)
#Print results
print(f'Best parameters are {randomized_cv.best_params_} with CV score={randomized_cv.best_score_}:')
Best parameters are {'n_estimators': 100, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=4, random_state=1)} with CV score=0.9549188906331765:
#Define tuned xgboost oversampled model
adaboost_tuned_undersampled = AdaBoostClassifier(n_estimators=100, learning_rate=0.05, base_estimator=DecisionTreeClassifier(max_depth=4, random_state=1))
#Fit tuned model on training oversampled set
adaboost_tuned_undersampled.fit(X_train_under, y_train_under)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=4,
random_state=1),
learning_rate=0.05, n_estimators=100)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=4,
random_state=1),
learning_rate=0.05, n_estimators=100)DecisionTreeClassifier(max_depth=4, random_state=1)
DecisionTreeClassifier(max_depth=4, random_state=1)
#Check model performance on undersampled training set
model_performance_classification_sklearn(adaboost_tuned_undersampled, X_train_under, y_train_under)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.997 | 1.000 | 0.994 | 0.997 |
#Check model performance on validations set
model_performance_classification_sklearn(adaboost_tuned_undersampled, X_val, y_val)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.934 | 0.951 | 0.724 | 0.822 |
#Define the model
model = RandomForestClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
'n_estimators': [25,50,100],
'min_samples_leaf': np.arange(1, 5),
'max_features': [np.arange(0.2, 0.7, 0.1),'sqrt'],
'max_samples': np.arange(0.3, 0.7, 0.1)
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring='recall', cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_under,y_train_under)
#Print results
print(f'Best parameters are {randomized_cv.best_params_} with CV score={randomized_cv.best_score_}:')
Best parameters are {'n_estimators': 100, 'min_samples_leaf': 1, 'max_samples': 0.5, 'max_features': 'sqrt'} with CV score=0.9323652537938253:
#Define tuned xgboost oversampled model
rf_tuned_undersampled = RandomForestClassifier(n_estimators=100, min_samples_leaf=1, max_samples=0.5, max_features='sqrt')
#Fit tuned model on training oversampled set
rf_tuned_undersampled.fit(X_train_under, y_train_under)
RandomForestClassifier(max_samples=0.5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.5)
#Check performance on undersampled training set
model_performance_classification_sklearn(rf_tuned_undersampled, X_train_under, y_train_under)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.995 | 0.998 | 0.992 | 0.995 |
#Check performance on validation set
model_performance_classification_sklearn(rf_tuned_undersampled, X_val, y_val)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.924 | 0.914 | 0.704 | 0.796 |
#Get tuned models' performance dataframes for training
xgboost_tuned_training = model_performance_classification_sklearn(xgboost_tuned_original, X_train, y_train)
xgboost_tuned_oversampled_training = model_performance_classification_sklearn(xgboost_tuned_oversampled, X_train_over, y_train_over)
bagging_tuned_oversampled_training = model_performance_classification_sklearn(bagging_tuned_oversampled, X_train_over, y_train_over)
xgboost_tuned_undersampled_training = model_performance_classification_sklearn(xgboost_tuned_undersampled, X_train_under, y_train_under)
adaboost_tuned_undersampled_training = model_performance_classification_sklearn(adaboost_tuned_undersampled, X_train_under, y_train_under)
rf_tuned_undersampled_training = model_performance_classification_sklearn(rf_tuned_undersampled, X_train_under, y_train_under)
#Concatenate model performance results into a single training dataframe
training_df = pd.concat([xgboost_tuned_training, xgboost_tuned_oversampled_training, bagging_tuned_oversampled_training,
xgboost_tuned_undersampled_training, adaboost_tuned_undersampled_training, rf_tuned_undersampled_training])
#Set index to model names
training_df.index = ['XGBoost Tuned Original', 'XGBoost Tuned Oversampled', 'Bagging Tuned Oversampled', 'XGBoost Tuned Undersampled', 'AdaBoost Tuned Undersampled', 'Random Forest Tuned Undersampled']
#Print dataframe
training_df.T
| XGBoost Tuned Original | XGBoost Tuned Oversampled | Bagging Tuned Oversampled | XGBoost Tuned Undersampled | AdaBoost Tuned Undersampled | Random Forest Tuned Undersampled | |
|---|---|---|---|---|---|---|
| Accuracy | 0.979 | 0.974 | 1.000 | 0.965 | 0.997 | 0.995 |
| Recall | 0.996 | 0.999 | 1.000 | 1.000 | 1.000 | 0.998 |
| Precision | 0.886 | 0.952 | 1.000 | 0.935 | 0.994 | 0.992 |
| F1 | 0.938 | 0.975 | 1.000 | 0.966 | 0.997 | 0.995 |
#Get tuned models' performance for validation
xgboost_tuned_validation = model_performance_classification_sklearn(xgboost_tuned_original, X_val, y_val)
xgboost_tuned_oversampled_validation = model_performance_classification_sklearn(xgboost_tuned_oversampled, X_val, y_val)
bagging_tuned_oversampled_validation = model_performance_classification_sklearn(bagging_tuned_oversampled, X_val, y_val)
xgboost_tuned_undersampled_validation = model_performance_classification_sklearn(xgboost_tuned_undersampled, X_val, y_val)
adaboost_tuned_undersampled_validation = model_performance_classification_sklearn(adaboost_tuned_undersampled, X_val, y_val)
rf_tuned_undersampled_validation = model_performance_classification_sklearn(rf_tuned_undersampled, X_val, y_val)
#Concatenate model performance results into a single validation dataframe
validation_df = pd.concat([xgboost_tuned_validation, xgboost_tuned_oversampled_validation, bagging_tuned_oversampled_validation,
xgboost_tuned_undersampled_validation, adaboost_tuned_undersampled_validation, rf_tuned_undersampled_validation])
#Set index to model names
validation_df.index = ['XGBoost Tuned Original', 'XGBoost Tuned Oversampled', 'Bagging Tuned Oversampled', 'XGBoost Tuned Undersampled', 'AdaBoost Tuned Undersampled', 'Random Forest Tuned Undersampled']
#Print dataframe
validation_df.T
| XGBoost Tuned Original | XGBoost Tuned Oversampled | Bagging Tuned Oversampled | XGBoost Tuned Undersampled | AdaBoost Tuned Undersampled | Random Forest Tuned Undersampled | |
|---|---|---|---|---|---|---|
| Accuracy | 0.960 | 0.936 | 0.961 | 0.899 | 0.934 | 0.924 |
| Recall | 0.942 | 0.957 | 0.896 | 0.988 | 0.951 | 0.914 |
| Precision | 0.832 | 0.731 | 0.866 | 0.616 | 0.724 | 0.704 |
| F1 | 0.883 | 0.829 | 0.881 | 0.759 | 0.822 | 0.796 |
#Check performance on test set
model_performance_classification_sklearn(xgboost_tuned_original, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.964 | 0.951 | 0.844 | 0.894 |
#Check performance on test set
model_performance_classification_sklearn(xgboost_tuned_oversampled, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.942 | 0.978 | 0.741 | 0.844 |
#Check performance on test set
model_performance_classification_sklearn(bagging_tuned_oversampled, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.962 | 0.908 | 0.865 | 0.886 |
#Check performance on test set
model_performance_classification_sklearn(xgboost_tuned_undersampled, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.882 | 0.997 | 0.575 | 0.730 |
#Check performance on test set
model_performance_classification_sklearn(adaboost_tuned_undersampled, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.941 | 0.969 | 0.743 | 0.841 |
#Check performance on test set
model_performance_classification_sklearn(rf_tuned_undersampled, X_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.924 | 0.954 | 0.692 | 0.802 |
#Combine testing results in a single dataframe
#Get testing dataframes for each tuned model
xgboost_tuned_original_testing = model_performance_classification_sklearn(xgboost_tuned_original, X_test, y_test)
xgboost_tuned_oversampled_testing = model_performance_classification_sklearn(xgboost_tuned_oversampled, X_test, y_test)
bagging_tuned_oversampled_testing = model_performance_classification_sklearn(bagging_tuned_oversampled, X_test, y_test)
xgboost_tuned_undersampled_testing = model_performance_classification_sklearn(xgboost_tuned_undersampled, X_test, y_test)
adaboost_tuned_undersampled_testing = model_performance_classification_sklearn(adaboost_tuned_undersampled, X_test, y_test)
rf_tuned_undersampled_testing = model_performance_classification_sklearn(rf_tuned_undersampled, X_test, y_test)
#Concatenate dataframes into a single testing dataframe
testing_df = pd.concat([xgboost_tuned_original_testing, xgboost_tuned_oversampled_testing, bagging_tuned_oversampled_testing,
xgboost_tuned_undersampled_testing, adaboost_tuned_undersampled_testing, rf_tuned_undersampled_testing])
#Set index to model names
testing_df.index = ['XGBoost Tuned Original', 'XGBoost Tuned Oversampled', 'Bagging Tuned Oversampled', 'XGBoost Tuned Undersampled', 'AdaBoost Tuned Undersampled', 'Random Forest Tuned Undersampled']
#Print training dataframe
training_df.T
| XGBoost Tuned Original | XGBoost Tuned Oversampled | Bagging Tuned Oversampled | XGBoost Tuned Undersampled | AdaBoost Tuned Undersampled | Random Forest Tuned Undersampled | |
|---|---|---|---|---|---|---|
| Accuracy | 0.979 | 0.974 | 1.000 | 0.965 | 0.997 | 0.995 |
| Recall | 0.996 | 0.999 | 1.000 | 1.000 | 1.000 | 0.998 |
| Precision | 0.886 | 0.952 | 1.000 | 0.935 | 0.994 | 0.992 |
| F1 | 0.938 | 0.975 | 1.000 | 0.966 | 0.997 | 0.995 |
#Print testing dataframe
testing_df.T
| XGBoost Tuned Original | XGBoost Tuned Oversampled | Bagging Tuned Oversampled | XGBoost Tuned Undersampled | AdaBoost Tuned Undersampled | Random Forest Tuned Undersampled | |
|---|---|---|---|---|---|---|
| Accuracy | 0.964 | 0.942 | 0.962 | 0.882 | 0.941 | 0.924 |
| Recall | 0.951 | 0.978 | 0.908 | 0.997 | 0.969 | 0.954 |
| Precision | 0.844 | 0.741 | 0.865 | 0.575 | 0.743 | 0.692 |
| F1 | 0.894 | 0.844 | 0.886 | 0.730 | 0.841 | 0.802 |
#Feature names
feature_names = X_train.columns
#Get feature importances and sort indices
importances = xgboost_tuned_original.feature_importances_
indices = np.argsort(importances)
#Get y ticks
y_ticks = [feature_names[i] for i in indices]
#Get indices
index = range(len(indices))
#Set figure size
plt.figure(figsize=(8, 8))
#Plot Feature Importance bar graph
plt.barh(index, importances[indices], color='blue', align='center')
#Include title
plt.title('Feature Importances')
#Label x and y axes
plt.xlabel('Relative Importance')
plt.yticks(index, y_ticks)
plt.show()
#Get mean total transaction count by attrition flag
churn.groupby('Attrition_Flag')['Total_Trans_Ct'].mean()
Attrition_Flag Attrited Customer 44.934 Existing Customer 68.673 Name: Total_Trans_Ct, dtype: float64
#Get mean total transaction count by attrition flag and gender
churn.groupby(['Attrition_Flag', 'Gender'])['Total_Trans_Ct'].mean()
Attrition_Flag Gender
Attrited Customer F 44.052
M 46.110
Existing Customer F 71.036
M 66.102
Name: Total_Trans_Ct, dtype: float64
#Get mean total revolving balance by attrition flag
churn.groupby('Attrition_Flag')['Total_Revolving_Bal'].mean()
Attrition_Flag Attrited Customer 672.823 Existing Customer 1256.604 Name: Total_Revolving_Bal, dtype: float64
#Get mean total revolving balance by attrition flag and gender
churn.groupby(['Attrition_Flag', 'Gender'])['Total_Revolving_Bal'].mean()
Attrition_Flag Gender
Attrited Customer F 667.208
M 680.316
Existing Customer F 1239.313
M 1275.407
Name: Total_Revolving_Bal, dtype: float64
#Get mean total transaction amount by attrition flag
churn.groupby('Attrition_Flag')['Total_Trans_Amt'].mean()
Attrition_Flag Attrited Customer 3095.026 Existing Customer 4654.656 Name: Total_Trans_Amt, dtype: float64
#Get mean total transaction amount by attrition flag and gender
churn.groupby(['Attrition_Flag', 'Gender'])['Total_Trans_Amt'].mean()
Attrition_Flag Gender
Attrited Customer F 2784.184
M 3509.779
Existing Customer F 4647.788
M 4662.124
Name: Total_Trans_Amt, dtype: float64
#Get mean total relationship count by attrition flag
churn.groupby('Attrition_Flag')['Total_Relationship_Count'].mean()
Attrition_Flag Attrited Customer 3.280 Existing Customer 3.915 Name: Total_Relationship_Count, dtype: float64
#Get mean total relationship count by attrition flag and gender
churn.groupby(['Attrition_Flag', 'Gender'])['Total_Relationship_Count'].mean()
Attrition_Flag Gender
Attrited Customer F 3.399
M 3.121
Existing Customer F 3.894
M 3.937
Name: Total_Relationship_Count, dtype: float64
#Get mean total count change Q4-Q1 by attrition flag
churn.groupby('Attrition_Flag')['Total_Ct_Chng_Q4_Q1'].mean()
Attrition_Flag Attrited Customer 0.554 Existing Customer 0.742 Name: Total_Ct_Chng_Q4_Q1, dtype: float64
#Filter for attrited customers
attrited_customers = churn[churn['Attrition_Flag']=='Attrited Customer']
#Get mean months inactive by gender
attrited_customers['Months_Inactive_12_mon'].value_counts(normalize=True).sort_index()
Months_Inactive_12_mon 0 0.009 1 0.061 2 0.310 3 0.508 4 0.080 5 0.020 6 0.012 Name: proportion, dtype: float64
#Filter for existing customers
existing_customers = churn[churn['Attrition_Flag'] == 'Existing Customer']
#Filter existing customers who generate less than 45 transactions (past 12 months) and have been inactive between 2 to 4 months
inactive_low_transactions = existing_customers[(existing_customers['Total_Trans_Ct'] < 45) & ((existing_customers['Months_Inactive_12_mon'] >= 2) & (existing_customers['Months_Inactive_12_mon'] <= 4))]
#Find male and female count of this population
inactive_low_transactions.groupby('Gender')['Gender'].count()
Gender F 402 M 718 Name: Gender, dtype: int64
Bank Churners dataset statistics/observations: